To create histograms, box, and time series plots using the ggplot2 package.
In this workshop, the aim is to cover how to work with dates in plots, and use histograms and box plots. We will be covering:
In this data visualisation workshop we will be building on the
concepts learnt in the first workshop, constructing visualisations using
the ggplot2 library.
We will be using one new package called lubridate, a tidyverse package which is designed to make working with dates and times easier; this will help us in making time series visualisations. Run the the code below to install lubridate.
# install lubridate
install.packages("lubridate")Before we start we will need to load the libraries we will be using during this session. Run the code below to load your libraries.
# libraries we will be using
library(ggplot2)
library(dplyr)
library(lubridate)
library(readr)
library(janitor)
library(RColorBrewer)Box plots are designed to compare the differences of a categorical variable (samples or groups). They do this by displaying the summary statistics of a continuous variable (e.g. numeric) for each categorical variable.
The summary statistics shown are:
Q1 - 1.5*IQRQ3 + 1.5*IQRWe will use data from the Pokémon games again for our examples for box plots, which was web scraped from https://pokemondb.net/pokedex/all.
# load and clean names
pokemon <- read_csv("https://raw.githubusercontent.com/andrewmoles2/webScraping/main/R/data/pokemon.csv") %>%
clean_names()
# review data
pokemon %>%
glimpse()## Rows: 952
## Columns: 13
## $ number <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ name <chr> "Bulbasaur", "Ivysaur", "Venusaur", "Charmander", "Charmele…
## $ type1 <chr> "Grass", "Grass", "Grass", "Fire", "Fire", "Fire", "Water",…
## $ type2 <chr> "Poison", "Poison", "Poison", NA, NA, "Flying", NA, NA, NA,…
## $ total <dbl> 318, 405, 525, 309, 405, 534, 314, 405, 530, 195, 205, 395,…
## $ hp <dbl> 45, 60, 80, 39, 58, 78, 44, 59, 79, 45, 50, 60, 40, 45, 65,…
## $ attack <dbl> 49, 62, 82, 52, 64, 84, 48, 63, 83, 30, 20, 45, 35, 25, 90,…
## $ defense <dbl> 49, 63, 83, 43, 58, 78, 65, 80, 100, 35, 55, 50, 30, 50, 40…
## $ sp_atk <dbl> 65, 80, 100, 60, 80, 109, 50, 65, 85, 20, 25, 90, 20, 25, 4…
## $ sp_def <dbl> 65, 80, 100, 50, 65, 85, 64, 80, 105, 20, 25, 80, 20, 25, 8…
## $ speed <dbl> 45, 60, 80, 65, 80, 100, 43, 58, 78, 45, 30, 70, 50, 35, 75…
## $ legendary <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ generation <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
For these examples we will just look at one type of Pokémon, the electric type; the most famous of which is Pikachu! First, we extract just the electric type Pokémon, and make relevant columns factors.
# select columns to convert to factor
to_factor <- c("type1", "type2", "generation")
# extract just electric pokemon and make cols factors
electric_pokemon <- pokemon %>%
filter(type1 == "Electric" | type2 == "Electric") %>%
mutate(across(all_of(to_factor), factor))
head(electric_pokemon)## # A tibble: 6 × 13
## number name type1 type2 total hp attack defense sp_atk sp_def speed
## <dbl> <chr> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 25 Pikachu Electric <NA> 320 35 55 40 50 50 90
## 2 26 Raichu Electric <NA> 485 60 90 55 90 80 110
## 3 81 Magnemite Electric Steel 325 25 35 70 95 55 45
## 4 82 Magneton Electric Steel 465 50 60 95 120 70 70
## 5 100 Voltorb Electric <NA> 330 40 30 50 55 55 100
## 6 101 Electrode Electric <NA> 490 60 50 70 80 80 150
## # … with 2 more variables: legendary <lgl>, generation <fct>
To make a box plot in ggplot we use the geom_boxplot()
geom function. One of our axis variables has to be categorical and the
other has to be numeric. In the below example we will use generation
(categorical) and total (numeric).
# generation by total
ggplot(electric_pokemon, aes(x = generation, y = total)) +
geom_boxplot()From the output we see a few things. First is that each box has a line through the middle which indicates the median; the box itself is our interquartile range. The lines above and below the boxes (known as whiskers) are the maximum and minimum values. The black dots indicate outliers, which have fallen outside our max and min values.
Just like with scatter and bar plots we can change the colours! You can use either fill or colour arguments with box plots, but fill tends to look better.
We will use the colour of Pikachu to colour our boxes. We used the pokemon colour picker to get the colour of pikachu: https://pokepalettes.com/#pikachu
ggplot(electric_pokemon, aes(x = generation, y = total)) +
geom_boxplot(fill = "#f6e652")Sometimes it is useful to remove the outliers. To do so you add in
the outlier.shape = NA argument.
ggplot(electric_pokemon, aes(x = generation, y = total)) +
geom_boxplot(fill = "#f6e652", outlier.shape = NA)Displaying outliers is usually a good idea so we will keep them for
now, and change the colour and shape of them. To adjust these we use
outlier.colour and outlier.shape argments.
We’ve used the colour of Pikachu’s cheeks as the outlier colour and made
the shape square.
ggplot(electric_pokemon, aes(x = generation, y = total)) +
geom_boxplot(fill = "#f6e652", outlier.colour = "#c52018",
outlier.shape = 15)For the exercises for this workshops we will be using daily COVID data that is collected from most of the countries around the world.
COVID data is from our world in data, which is stored in a GitHub repository. More information on the data and what each variable means can be found here: https://github.com/owid/covid-19-data/tree/master/public/data
# load in covid data and select cases, deaths and vaccines
covid <- read_csv("https://covid.ourworldindata.org/data/owid-covid-data.csv") %>%
select(iso_code:new_deaths_smoothed_per_million, contains("vaccin"),
population, median_age, gdp_per_capita)## Rows: 181452 Columns: 67
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): iso_code, continent, location, tests_units
## dbl (62): total_cases, new_cases, new_cases_smoothed, total_deaths, new_dea...
## date (1): date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# have a quick look at the data
covid %>% glimpse()## Rows: 181,452
## Columns: 30
## $ iso_code <chr> "AFG", "AFG", "AFG", "AFG",…
## $ continent <chr> "Asia", "Asia", "Asia", "As…
## $ location <chr> "Afghanistan", "Afghanistan…
## $ date <date> 2020-02-24, 2020-02-25, 20…
## $ total_cases <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, …
## $ new_cases <dbl> 5, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ new_cases_smoothed <dbl> NA, NA, NA, NA, NA, 0.714, …
## $ total_deaths <dbl> NA, NA, NA, NA, NA, NA, NA,…
## $ new_deaths <dbl> NA, NA, NA, NA, NA, NA, NA,…
## $ new_deaths_smoothed <dbl> NA, NA, NA, NA, NA, NA, NA,…
## $ total_cases_per_million <dbl> 0.126, 0.126, 0.126, 0.126,…
## $ new_cases_per_million <dbl> 0.126, 0.000, 0.000, 0.000,…
## $ new_cases_smoothed_per_million <dbl> NA, NA, NA, NA, NA, 0.018, …
## $ total_deaths_per_million <dbl> NA, NA, NA, NA, NA, NA, NA,…
## $ new_deaths_per_million <dbl> NA, NA, NA, NA, NA, NA, NA,…
## $ new_deaths_smoothed_per_million <dbl> NA, NA, NA, NA, NA, NA, NA,…
## $ total_vaccinations <dbl> NA, NA, NA, NA, NA, NA, NA,…
## $ people_vaccinated <dbl> NA, NA, NA, NA, NA, NA, NA,…
## $ people_fully_vaccinated <dbl> NA, NA, NA, NA, NA, NA, NA,…
## $ new_vaccinations <dbl> NA, NA, NA, NA, NA, NA, NA,…
## $ new_vaccinations_smoothed <dbl> NA, NA, NA, NA, NA, NA, NA,…
## $ total_vaccinations_per_hundred <dbl> NA, NA, NA, NA, NA, NA, NA,…
## $ people_vaccinated_per_hundred <dbl> NA, NA, NA, NA, NA, NA, NA,…
## $ people_fully_vaccinated_per_hundred <dbl> NA, NA, NA, NA, NA, NA, NA,…
## $ new_vaccinations_smoothed_per_million <dbl> NA, NA, NA, NA, NA, NA, NA,…
## $ new_people_vaccinated_smoothed <dbl> NA, NA, NA, NA, NA, NA, NA,…
## $ new_people_vaccinated_smoothed_per_hundred <dbl> NA, NA, NA, NA, NA, NA, NA,…
## $ population <dbl> 39835428, 39835428, 3983542…
## $ median_age <dbl> 18.6, 18.6, 18.6, 18.6, 18.…
## $ gdp_per_capita <dbl> 1803.987, 1803.987, 1803.98…
For this exercise will we make two box plots from our data looking more at the demographics of each continent (we will look at cases and vaccines later).
Your two box plots should show the following:
Hint: you will have to remove the na values from continent before
plotting, e.g. covid %>% filter(!is.na(continent))
Hint: You can pipe from your filter function straight into ggplot2!
Hint: You can add colours in lots of ways but it can be fun to use a colour picker http://tristen.ca/hcl-picker/#/hlc/11/1.1/DC7261/D77357.
# your code hereThe main issue with box plots, in a similar way to bar plots, is they can hide data. We can fix this by adding a scatter plot over the top of the boxes so we can see the full distribution of the data.
When adding in a scatter plot, we won’t need our outliers as the
scatter plot will show these for us. We will need to remove them using
the outlier.shape = NA argument.
ggplot(electric_pokemon, aes(x = generation, y = total)) +
geom_boxplot(fill = "#f6e652", outlier.shape = NA) +
geom_point()Some of our data points are overlapping which makes it a little hard
to see all the data. We can fix this by changing the position of our
points using the position = "jitter" argument. We can also
use geom_jitter() which is a short hand for
geom_point(position = "jitter"); we will use
geom_jitter() going forward as it is less typing.
# change position in geom_point
ggplot(electric_pokemon, aes(x = generation, y = total)) +
geom_boxplot(fill = "#f6e652", outlier.shape = NA) +
geom_point(position = "jitter")# using geom_jitter
ggplot(electric_pokemon, aes(x = generation, y = total)) +
geom_boxplot(fill = "#f6e652", outlier.shape = NA) +
geom_jitter()We can also add in a colour grouping to our points to make them more
meaningful. We add the colour aesthetic to our geom_jitter
function. In the example we are colouring our points by if a pokemon is
legendary or not.
ggplot(electric_pokemon, aes(x = generation, y = total)) +
geom_boxplot(fill = "#f6e652", outlier.shape = NA) +
geom_jitter(aes(colour = legendary))Finally we can change the colours of our points, which in this case we have done manually. Again, the colours were taken from the pokemon colour picker of pikachu: https://pokepalettes.com/#pikachu.
ggplot(electric_pokemon, aes(x = generation, y = total)) +
geom_boxplot(fill = "#f6e652", outlier.shape = NA) +
geom_jitter(aes(colour = legendary)) +
scale_colour_manual(values = c("#c52018", "#41414a"))Now we can add a title and save the plot! When saving the plot we have manually adjusted the width of the plot. You can also change the height.
electric_pokemon_box <- ggplot(electric_pokemon, aes(x = generation, y = total)) +
geom_boxplot(fill = "#f6e652", outlier.shape = NA) +
geom_jitter(aes(colour = legendary)) +
scale_colour_manual(values = c("#c52018", "#41414a")) +
labs(title = "Summary of electric pokemon for each generation") +
theme_bw()
electric_pokemon_boxggsave("electric_pokemon_box.png", electric_pokemon_box,
width = 5.5)## Saving 5.5 x 5 in image
For this exercise we will look at vaccines! We will look at 10 countries to see the difference in vaccine distribution; 5 have low gdp and 5 have high gdp. The data will be pre-prepared for you. We have made a vector with the counties that have high and low gdp. Then we have filtered our covid data by this vector, and made the location a ordered factor.
geom_jitter().ggsave(). You will need to assign
the plots to a variable first.# Make vector with low and high gdp countries
high_low_gdp <- c("Sierra Leone", "Ethiopia","Yemen",
"Zambia", "Nepal", "Sweden", "Australia",
"Saudi Arabia", "Germany", "United Kingdom")
# Only include locations in high_low_gdp
# Make location a factor, ordered by high_low_gdp
covid_select_countries <- covid %>%
filter(location %in% high_low_gdp) %>%
mutate(location = factor(location, levels = high_low_gdp))
# your code hereHistograms are great for visualising the distribution of numeric data. Histograms have one numerical variable as their input.
To make a histogram with ggplot we provide a numerical value to our x
axis, and use the geom_histogram() geom. In the example we
are using all the pokemon data and showing the distribution of the total
column.
ggplot(pokemon, aes(x = total)) +
geom_histogram()## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can adjust the size of the bins of our plot with two methods, changing the binwidth or selecting the amount of bins. When we talk about bins with histograms it refers to the size of each bar; the larger the bar the more data on the x axis is included.
The first example uses binwidth. The number you provide
is directly related to your x axis. In our example we are using the
total column which goes up to 754. If we have binwidth = 8,
then 8 data points will be included in each bin. Run the two examples
below with a smaller and larger binwidth to see the results.
# summary stats for total column
summary(pokemon$total)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 175.0 325.0 450.0 429.5 505.0 754.0
# binwidth of 8
ggplot(pokemon, aes(x = total)) +
geom_histogram(binwidth = 8) +
labs(title = "Small binwidth (8)")# binwidth of 50
ggplot(pokemon, aes(x = total)) +
geom_histogram(binwidth = 50) +
labs(title = "Larger binwidth (50)")The other method is to select the number of bins to use, using the
bins argument. The more bins we use, the less data will be
contained in each bin. In the example below we have bins with lots of
data bins = 10 and bins with less data
bins = 50. Which do you think is best?
# using 10 bins
ggplot(pokemon, aes(x = total)) +
geom_histogram(bins = 10) +
labs(title = "Less bins = more data in each bin")# using 50 bins
ggplot(pokemon, aes(x = total)) +
geom_histogram(bins = 50) +
labs(title = "More bins = less data in each bin")It can be helpful to colour your histogram by a categorical variable.
This works the same as a box plot, using the fill argument.
In the example we have filled our histogram by the legendary
category.
ggplot(pokemon, aes(x = total, fill = legendary)) +
geom_histogram(binwidth = 20)Another useful method is to use facets, which split up your data by a categorical variable and presents them in a grid like formation.
There are two techniques in ggplot to make facets, using
facet_grid() or facet_wrap(). To use
facet_grid() we define if we want to display our data
row-wise (rows =) or column-wise (cols =).
When defining which column to split our data by we need to use the
vars() function. See the two examples below on how to do a
row or column facet grid.
# row-wise display
ggplot(pokemon, aes(x = total, fill = legendary)) +
geom_histogram(binwidth = 20) +
facet_grid(rows = vars(legendary)) +
labs(title = "Row-wise facet grid")# column-wise display
ggplot(pokemon, aes(x = total, fill = legendary)) +
geom_histogram(binwidth = 20) +
facet_grid(cols = vars(legendary)) +
labs(title = "column-wise facet grid")The other option is facet_wrap(), which by default only
needs the column you want to split your data by. It does allow extra
specification with the nrow and ncol
functions, allowing you to define how many rows and columns to
display.
In the examples below we show the default facet_wrap,
and how to adjust the column or row specification. We have used the
generation column as it has more groups.
# default facet_wrap
ggplot(pokemon, aes(x = total, fill = legendary)) +
geom_histogram(binwidth = 20) +
facet_wrap(vars(generation)) +
labs(title = "Default facet wrap")# 4 rows
ggplot(pokemon, aes(x = total, fill = legendary)) +
geom_histogram(binwidth = 20) +
facet_wrap(vars(generation),
nrow = 4) +
labs(title = "Facet wrap with 4 rows")# 4 columns
ggplot(pokemon, aes(x = total, fill = legendary)) +
geom_histogram(binwidth = 20) +
facet_wrap(vars(generation),
ncol = 4) +
labs(title = "Facet wrap with 4 columns")For this exercise we will be making a histogram of using the people_fully_vaccinated_per_hundred column for each continent
binwidth or bins
(e.g. binwidth = 5 looks good)Hint: you will have to remove the na values from continent before
plotting, e.g. covid %>% filter(!is.na(continent))
Hint: You can pipe from your filter function straight into ggplot2!
Hint: To change the fill colours you can use
scale_fill_brewer(palette = "a palette")
Hint: Use brewer.pal.info to find RColorBrewer
palettes
# your code hereWorking with the date data type when programming can be a bit tricky
for many reasons. There are different formats, time zones, and the
challenge extracting information from the date. Fortunately, the
lubridate package comes to the rescue!
There are three types of date data type: date (2010-09-01), time (15:08:52 BST), date-time (2010-09-01 15:08:52 BST). For this workshop we will be focusing on the date type as it is the most common.
You can find out today’s date (more useful than it sounds) or the
date and time using the today() or now()
functions.
# make sure dplyr and lubridate are loaded
library(dplyr)
library(lubridate)
# get today's date
today()## [1] "2022-04-25"
# today's date and time
now()## [1] "2022-04-25 16:13:00 BST"
# make today's date a variable
today_date <- today()A great feature of lubridate is extracting the year, month, day, or week day information from your date. We can test it out on today’s date. Run the code to see how the output.
# year
year(today_date)## [1] 2022
# month
month(today_date)## [1] 4
month(today_date, label = TRUE)## [1] Apr
## 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
# week
week(today_date)## [1] 17
# day
day(today_date)## [1] 25
# weekday
wday(today_date)## [1] 2
wday(today_date, label = TRUE)## [1] Mon
## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
Notice that for the month and wday
functions we have the option to add labels. This can be very useful,
making your month or week day outputs more readable.
For the rest of the examples we will use some randomised made up data containing daily sleep, and step information. Run the code below to see the data.
note: to make this data we have used randomisation functions:
sample, runif and rnorm, if you
are interested look them up to see how they work
# make some random data
df <- data.frame(
date = seq(as.Date("2019-01-01"), as.Date("2021-12-01"), by = "days"),
hours_sleep = round(rnorm(1066, mean = 9, sd = 1.5)),
steps = round(rnorm(1066, mean = 8000, sd = 2000))
)
head(df)## date hours_sleep steps
## 1 2019-01-01 9 12131
## 2 2019-01-02 9 11735
## 3 2019-01-03 10 6775
## 4 2019-01-04 8 12866
## 5 2019-01-05 11 4078
## 6 2019-01-06 6 8431
We can now use the mutate function to make a year,
month, week, day, and week day column.
df <- df %>%
mutate(year = year(date),
month = month(date, label = TRUE),
week = week(date),
day = day(date),
week_day = wday(date, label = TRUE))
head(df)## date hours_sleep steps year month week day week_day
## 1 2019-01-01 9 12131 2019 Jan 1 1 Tue
## 2 2019-01-02 9 11735 2019 Jan 1 2 Wed
## 3 2019-01-03 10 6775 2019 Jan 1 3 Thu
## 4 2019-01-04 8 12866 2019 Jan 1 4 Fri
## 5 2019-01-05 11 4078 2019 Jan 1 5 Sat
## 6 2019-01-06 6 8431 2019 Jan 1 6 Sun
# see the breakdown of the date
df[1:2, c("date", "year", "month", "week", "day", "week_day")]## date year month week day week_day
## 1 2019-01-01 2019 Jan 1 1 Tue
## 2 2019-01-02 2019 Jan 1 2 Wed
Breaking the date down in this way allows us to do some aggregation of our data by the year, month, week, day, or weekday! In the examples below we have shown year and weekday.
# aggregate by year
df %>%
group_by(year) %>%
summarise(avg_sleep = mean(hours_sleep),
avg_steps = mean(steps),
total_steps = sum(steps))## # A tibble: 3 × 4
## year avg_sleep avg_steps total_steps
## <dbl> <dbl> <dbl> <dbl>
## 1 2019 9.09 8050. 2938146
## 2 2020 8.93 8042. 2943308
## 3 2021 9.16 7989. 2676161
# aggregate by week day
df %>%
group_by(week_day) %>%
summarise(avg_sleep = mean(hours_sleep),
avg_steps = mean(steps),
total_steps = sum(steps))## # A tibble: 7 × 4
## week_day avg_sleep avg_steps total_steps
## <ord> <dbl> <dbl> <dbl>
## 1 Sun 9.08 8049. 1223494
## 2 Mon 8.95 8036. 1221526
## 3 Tue 8.92 7903. 1209200
## 4 Wed 9.26 7783. 1190823
## 5 Thu 8.96 8244. 1253021
## 6 Fri 9.03 8331. 1266249
## 7 Sat 9.17 7851. 1193302
There are more functions from the lubridate package that we won’t be able to cover in this session, so do have a look at the package website for more information - https://lubridate.tidyverse.org/index.html - and checkout the R for Data Science chapter on dates - https://r4ds.had.co.nz/dates-and-times.html.
Using the examples above, extract year, month, day, day of week from covid data, and do an aggregation!
# your code here
# separate date column
covid <- covid %>%
mutate(year = year(date),
month = month(date, label = TRUE),
week = week(date),
day = day(date),
week_day = wday(date, label = TRUE))
# make year and month aggregate
avg_year_month_covid <- covid %>%
group_by(year, month) %>%
summarise(
avg_total_cases_per_mil = mean(total_cases_per_million, na.rm = TRUE),
avg_total_deaths_per_mil = mean(total_deaths_per_million, na.rm = TRUE)
)## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
avg_year_month_covid## # A tibble: 28 × 4
## # Groups: year [3]
## year month avg_total_cases_per_mil avg_total_deaths_per_mil
## <dbl> <ord> <dbl> <dbl>
## 1 2020 Jan 0.642 0.0318
## 2 2020 Feb 3.27 0.325
## 3 2020 Mar 128. 8.39
## 4 2020 Apr 606. 34.0
## 5 2020 May 1078. 58.9
## 6 2020 Jun 1654. 74.3
## 7 2020 Jul 2414. 92.9
## 8 2020 Aug 3434. 114.
## 9 2020 Sep 4641. 136.
## 10 2020 Oct 6508. 161.
## # … with 18 more rows
Time series plots visualise data over a period of time, which can be hourly, daily, weekly, monthly, or yearly. It is a great way to view trends over time. When plotting a time series, the x axis is the date and the y axis is your measure.
The most simple form of a time series visualisation in R is to use an
unedited date variable. Using our example data (df) we will
visualise how steps have changed each day.
# daily time series
df %>%
ggplot(aes(x = date, y = steps)) +
geom_line()As we can see it is pretty variable how many steps are taken each day, as you might expect. There is a lot of data here so it is hard to see any real patterns, it just looks like noise! To solve this we can aggregate our data by the year, the month or the week to see if we can get any more insights.
For the example data we have it might be interesting to see the average of how many steps are taken on average each month, and to also compare this year on year.
We first aggregate our data, grouping by the month and year columns we made with the lubridate package, find the average steps, and convert the year column into a factor to make plotting easier; month is already a factor.
# aggregated time series by month
monthly_steps <- df %>%
group_by(month, year) %>%
summarise(avg_steps = mean(steps)) %>%
mutate(year = factor(year))## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.
monthly_steps## # A tibble: 36 × 3
## # Groups: month [12]
## month year avg_steps
## <ord> <fct> <dbl>
## 1 Jan 2019 7899.
## 2 Jan 2020 8161.
## 3 Jan 2021 7986.
## 4 Feb 2019 7517.
## 5 Feb 2020 7836.
## 6 Feb 2021 8563.
## 7 Mar 2019 7848.
## 8 Mar 2020 8398.
## 9 Mar 2021 8217.
## 10 Apr 2019 8285.
## # … with 26 more rows
Now we can make a time series by month! It is often helpful when
using geom_line() to also pair it with
geom_point() so we can see each data point clearly as well
as seeing the trends with shown by the lines.
ggplot(monthly_steps,
aes(x = month, y = avg_steps)) +
geom_line() +
geom_point()That didn’t work as expected! As our data is grouped by year and
month we need to use the group = argument to tell ggplot we
want to connect the months up.
By adding group = year our plot will now look like a
time series, run the code to check it out.
ggplot(monthly_steps,
aes(x = month, y = avg_steps,
group = year)) +
geom_line() +
geom_point()It would also be helpful to see what year each line represents. We
add the colour = year argument in as well to show this.
ggplot(monthly_steps,
aes(x = month, y = avg_steps,
group = year, colour = year)) +
geom_line() +
geom_point()Our plot is still looking a little busy so we can use facets to split
our data by year. We’ve used facet_wrap here with 3
rows.
ggplot(monthly_steps,
aes(x = month, y = avg_steps,
group = year, colour = year)) +
geom_line() +
geom_point() +
facet_wrap(vars(year), nrow = 3)Finally, we can make a few final adjustments and we have a nice visualisation that shows average step count per month for the year 2019 to 2021. Below is a list of all the additions make to change the look of the plot:
size = argumentstep_count <- ggplot(monthly_steps,
aes(x = month, y = avg_steps,
group = year, colour = year)) +
geom_line(size = 2.5) +
geom_point(size = 3) +
facet_wrap(vars(year), nrow = 3) +
labs(title = "Average step count per month for the year 2019 to 2021",
x = "Month", y = "Average steps (mean)",
colour = "Year") +
scale_colour_brewer(palette = "Pastel2") +
theme_dark(base_family = "Avenir") +
scale_y_continuous(limits = c(7000, 9000))
step_count## Warning: Removed 1 row(s) containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_point).
ggsave("step_count.png", step_count, width = 9)## Saving 9 x 5 in image
## Warning: Removed 1 row(s) containing missing values (geom_path).
## Removed 1 rows containing missing values (geom_point).
For this exercise we will be looking at the vaccine roll out for United Kingdom, India, Nepal, Israel, Germany, and Australia. Each country has had slightly different roll outs, with Israel being the fastest. We will be looking at the week by week roll out for 2021.
Data preparation:
people_vaccinated_per_hundred column.
Assign the result back to weekly_vaxPlotting:
Using your weekly_vax data you have just prepared:
people_vaccinated_per_hundred column as
your y axis.facet_grid() or
facet_wrap()).Hint: if your x axis is looking squashed or cramped, try adding in
scale_x_discrete(guide = guide_axis(n.dodge = 2))
# your code hereWe would be grateful if you could take a minute before the end of the workshop so we can get your feedback!
The solutions we be available from a link at the end of the survey.
For the coding challenge we will look at other things you can do with ggplot2 such as making artwork! This is known as generative art, which is produced either in part or completely by automated processes.
Generative art is a complex topic, but some of the ideas and styles can be done using the aRtsy package, https://koenderks.github.io/aRtsy/, which makes generative art more accessible.
First, you will need to install the aRtsy package.
# install aRtsy
install.packages("aRtsy")Then you will need to load it!
# load aRtsy
library(aRtsy)When making generative art it is a good idea to make it reproducible
as we there is a lot of randomisation involved. When randomising in R
you need to set a seed, which in simple terms means we
reproduce our results using the same seed. We use the
set.seed() function and add in any number. The number is
our seed. If we gave someone else our code and our seed they would be
able to reproduce or results.
We’ve given some examples below on making a striped artwork and flow fields. Run the code chunk below, then try changing the seed to see how the results change when you run it again!
Note: these will take a few moments to run!
# set the seed to 1
set.seed(1)
# make a colour palette from rcolorbrewer
set1 <- brewer.pal(n = 9, name = "Set1")
pastel1 <- brewer.pal(n = 9, name = "Pastel1")
paired <- brewer.pal(n = 12, name = "Paired")
# test out different parameters for stripes
canvas_stripes(paired, n = 800, H = 5, burnin = 5)canvas_stripes(pastel1, n = 500, H = 15, burnin = 2)# Test out different parameters for flow fields
canvas_flow(set1, background = "#fafafa", lines = 800, lwd = 0.30,
iterations = 80, stepmax = 0.15)pastel_flow <- canvas_flow(pastel1, background = "black", lines = 2000, lwd = 0.15,
iterations = 30, stepmax = 0.10)
pastel_flow# save pastel_flow
saveCanvas(pastel_flow, "pastel_flow.png")Have a go yourself at making some generative art in R! Try out the following functions from aRtsy, changing the parameters to adjust the visualisation.
canvas_flow() https://koenderks.github.io/aRtsy/reference/canvas_flow.htmlcanvas_stripes() https://koenderks.github.io/aRtsy/reference/canvas_stripes.htmlcanvas_watercolors() https://koenderks.github.io/aRtsy/reference/canvas_watercolors.htmlDon’t forget to save any of your artwork you like using the
saveCanvas() function.
set.seed(1)
# your code hereThe ggplot2 book is an excellent resource with lots of examples and exercises to have a go at https://ggplot2-book.org/.
Cedric Scherer writes blogs and tutorials on ggplot2 on his website. Some of his content is really great and worth looking through. Below are two of his tutorials to get you started:
Georgios Karamanis is a data visualisation designer and makes some amazing visualisations using R! It’s worth browsing his website for inspiration https://karaman.is/ or following him on twitter https://twitter.com/geokaramanis.
For ideas about what to do with your data have a look at the R graph gallery https://www.r-graph-gallery.com/.